Kickstarter is a crowdfunding platform that allows a global community to back and fund people’s creative projects and ideas. Kickstarter was launched in 2009. In 2018, 15 billion people have pledged $4 Billion dollars, and 150 thousand projects have been successfully funded.
This dataset was downloaded from Kaggle (https://www.kaggle.com/kemical/kickstarter-projects). According to Kaggle this data was consumed in January 2018.
## [1] "There are 378661 rows and 15 fields in the Original Kickstarter Data set."
The columns and their data type are shown below:
## ID name category main_category
## "integer" "factor" "factor" "factor"
## currency deadline goal launched
## "factor" "factor" "numeric" "factor"
## pledged state backers country
## "numeric" "factor" "integer" "factor"
## usd.pledged usd_pledged_real usd_goal_real
## "numeric" "numeric" "numeric"
Here are the top 3 rows of the Kickstarter Dataset:
| ID | name | category | main_category | currency | deadline | goal | launched | pledged | state | backers | country | usd.pledged | usd_pledged_real | usd_goal_real |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1000002330 | The Songs of Adelaide & Abullah | Poetry | Publishing | GBP | 2015-10-09 | 1000 | 2015-08-11 12:12:28 | 0 | failed | 0 | GB | 0 | 0 | 1533.95 |
| 1000003930 | Greeting From Earth: ZGAC Arts Capsule For ET | Narrative Film | Film & Video | USD | 2017-11-01 | 30000 | 2017-09-02 04:43:57 | 2421 | failed | 15 | US | 100 | 2421 | 30000.00 |
| 1000004038 | Where is Hank? | Narrative Film | Film & Video | USD | 2013-02-26 | 45000 | 2013-01-12 00:20:50 | 220 | failed | 3 | US | 220 | 220 | 45000.00 |
It looks like there are two similar columns: one is usd.pledged and the other is usd_pledged_real. After taking a look at the data overview tab on Kaggle, I have decided to use only the usd.pledged.real column since this column was converted using the more accurate Fixer.io API. Similarly, I will be using usd_goal_real instead of goal. Finally, I will drop the currency column since everything has been converted to USD.
Next, I need to check if there are any NaNs in the dataset:
## [1] "There are not any NaNs in the dataset: TRUE"
To make life easier in the future, I will convert the deadline and launched columns to dates and datetimes using the lubridate library:
## [1] "Launched and Deadlines have been converted to Dates: TRUE"
I will also be using the launched and deadline fields to create a new field called project_length, which represents the number of days that the project was live.
Finally, I will add a field called pledge_per_backer, which divides a project’s pledge amount by the number of backers.
Here is the first few rows of the cleaned up dataset:
## [1] "There are 378661 rows and 14 fields in the Cleaned Kickstarter Data set."
| ID | name | category | main_category | deadline | launched | pledged | state | backers | country | usd_pledged_real | usd_goal_real | project_length | pledge_per_backer |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1000002330 | The Songs of Adelaide & Abullah | Poetry | Publishing | 2015-10-09 | 2015-08-11 12:12:28 | 0 | failed | 0 | GB | 0 | 1533.95 | 58 days | 0.00000 |
| 1000003930 | Greeting From Earth: ZGAC Arts Capsule For ET | Narrative Film | Film & Video | 2017-11-01 | 2017-09-02 04:43:57 | 2421 | failed | 15 | US | 2421 | 30000.00 | 59 days | 161.40000 |
| 1000004038 | Where is Hank? | Narrative Film | Film & Video | 2013-02-26 | 2013-01-12 00:20:50 | 220 | failed | 3 | US | 220 | 45000.00 | 44 days | 73.33333 |
Here are some quick descriptions of some of the columns:
Univariate Analysis is the analysis of one variable. For example, getting the counts per project category. The following section will focus on this type of analysis. However, at times the process of Exploratory Data Analysis may lead me to do a little bit of bivariate analysis.
Let’s start by getting some summary stats on the columns.
## ID name
## Min. :5.971e+03 New EP/Music Development: 41
## 1st Qu.:5.383e+08 Canceled (Canceled) : 13
## Median :1.075e+09 Music Video : 11
## Mean :1.075e+09 N/A (Canceled) : 11
## 3rd Qu.:1.610e+09 Cancelled (Canceled) : 10
## Max. :2.147e+09 Debut Album : 10
## (Other) :378565
## category main_category deadline
## Product Design: 22314 Film & Video: 63585 Min. :2009-05-03
## Documentary : 16139 Music : 51918 1st Qu.:2013-06-08
## Music : 15727 Publishing : 39874 Median :2015-01-14
## Tabletop Games: 14180 Games : 35231 Mean :2014-11-01
## Shorts : 12357 Technology : 32569 3rd Qu.:2016-04-28
## Video Games : 11830 Design : 30070 Max. :2018-03-03
## (Other) :286114 (Other) :125414
## launched pledged state
## Min. :1970-01-01 01:00:00 Min. : 0 canceled : 38779
## 1st Qu.:2013-05-07 22:14:27 1st Qu.: 30 failed :197719
## Median :2014-12-10 03:23:41 Median : 620 live : 2799
## Mean :2014-09-28 18:06:17 Mean : 9683 successful:133956
## 3rd Qu.:2016-03-24 10:21:09 3rd Qu.: 4076 suspended : 1846
## Max. :2018-01-02 15:02:31 Max. :20338986 undefined : 3562
##
## backers country usd_pledged_real
## Min. : 0.0 US :292627 Min. : 0
## 1st Qu.: 2.0 GB : 33672 1st Qu.: 31
## Median : 12.0 CA : 14756 Median : 624
## Mean : 105.6 AU : 7839 Mean : 9059
## 3rd Qu.: 56.0 DE : 4171 3rd Qu.: 4050
## Max. :219382.0 N,0" : 3797 Max. :20338986
## (Other): 21799
## usd_goal_real project_length pledge_per_backer
## Min. : 0 Length:378661 Min. : 0.00
## 1st Qu.: 2000 Class :difftime 1st Qu.:14.20
## Median : 5500 Mode :numeric Median :40.84
## Mean : 45454 Mean : Inf
## 3rd Qu.: 15500 3rd Qu.:77.22
## Max. :166361391 Max. : Inf
##
To get a feel of what the Kickstarter dataset looks like, I’ll create some quick Histograms that count the number of projects for some of our categorical data.
I’ll first take a look at the number of projects per Main Category.
## [1] "There are 15 different Main Categories."
Film and Video projects are clearly the most popular main category, while Dance is the least popular. Let’s take a look at the projects by category.
## [1] "There are 159 different Categories."
Since there are so many categories, let’s just show the top 15 in the chart
It looks like Product Design, Documentary, and Music are the top three categories. Let’s get an idea of the distribution between the different states of a project.
There are 6 different states that a project can be in. As expected, ‘failed’ and ‘successful’ are the most common. I’m curious which of the Main Categories have the greatest success rates.
One thing I find interesting is that the 3 Main Categories with the highest success rates are some of the more uncommon types of projects. For example, Dance has the highest success rate but the smallest number of projects. Let’s look at the other side of things and see which Main Categories have the highest failure rate.
Journalism, Food, and Crafts have the highest failure rates. Let’s take a look at the number of backers per Main category.
Games have the most number of backers, so you would think that Games also have a high success rate. However, they are in the middle of the pack when it comes to success rates. This could be because of multiple reasons. One possible reason is because Game backers do not pledge as much money, so even though they have many backers they may not reach the monetary goal. Let’s see if this is true.
The average pledge per backer for Games is about $65, which is the 4th lowest of all the Main Categories. I also suspect that games (especially video games) cost alot too make, so this may be why the success rate for Games is lower as well. Let’s see what the average goal amount is for each Main Category.
Games have the 5th highest Average Goal amount. This combined with the fact that Games also have a low average pledge amount per backer may be the reason why Games have a mediocre success rate relative to the high number of backers that they get.
Next, I want to take a look at the distribution of Pledge and Goal Amounts for all the projects.
The distribution of Pledge amount for the projects are heavily right-skewed. There are around 50,000 projects that have been pledged less than $1000 and then there are outliers that have been pledged alot more money. Let’s try a log transformation so we can see both sides of the spectrum.
After doing a log transformation, you can see that the distribution is bimodal. The first spike represents all the projects with low Pledge amounts. There is another mode for higher pledge amounts. Let’s take a look at Goal amounts.
Similar to the Pledge amount distribution, the Goal amount distribution is right-skewed. However, you can see a pattern where there is a spike in $50,000 increments. This is probably because people enter Goal amounts in these types of increments. Let’s do a log transformation and plot on the same chart as the Pledge distributions.
One interesting thing from this histogram is that you can see that alot of the Pledge amounts are to the left of the Goal amounts, which means many of the projects probably failed to meet their goal. This aligns with what some of the previous charts showed as well.
Lastly, some other factors that I wanted to look at was spatial and temporal distribution. Let’s take quick look at which countries had the most projects.
It’s clear that US, Great Britain, and Canada have the most Kickstarter Projects. Let’s quickly look at the number of projects over time.
Since its beginning in 2009, Kickstarter projects increased and peaked in 2015. In 2016-2017, the number of projects slightly decreased. It’s important to note that this dataset was collected in January of 2018, which explains why there are no projects for 2018. Let’s see how many projects were created by project length.
The most common project length is 15-30 days by far. The longest project length in this data set is 91 days.
There are 378,661 Kickstarter projects in this dataset and 14 features. (id, name, category, main_category, deadline, launched, pledged, state, backers, country, usd_pledge_real, usd_goal_real, project_length, and pledge_per_backer).
There are 15 main categories and 159 categories. Also, there are 6 different states that a project can be in.
Here are some other observations:
The feature that I’m most interested in is state. More specifically, I’m interested in figuring out what type of projects and specifications are more likely to succeed (aka Success Rate).
I’m going to take a look at category, main_category, project_length, backers, and pledge_per_backer to see if those variables contribute to a project’s success rates.
Yes, I created the project_length and pledge_per_backer variables.
When I visualized the distribution of pledge and goal amounts they were both heavily right-skewed, which makes sense because most projects will have lower pledge and goal amounts. But as expected, there are a few big-budget projects that will need more funding, which creates that right-skewed tail. In an attempt to normalize the distribution, I did a log transformation. The results showed that the pledge amount distribution was bimodal and the goal amount distribution was very close to normally distributed.
It looks like the dataset has some projects where there are 0 backers, but has a pledge amount. This creates erroneous data for the pledge_per_backer field, since you cannot divide a number by 0. For the sake of this analysis, I’ll remove these rows from the data set. Note: There are 3,082 of these projects.
I will also be creating a new field called success_state, where a successful state is 1, an failed state is 0, and any other state is -1. Then I’ll remove all the -1 success_states and only leave projects that failed or succeeded.
Now that we cleaned up some of the data for this bivariate analysis, let’s take a look and see if any of the numerical variables are correlated with each other.
## backers usd_pledged_real usd_goal_real
## backers 1.0000000000 0.752831767 0.007473407
## usd_pledged_real 0.7528317673 1.000000000 0.008786419
## usd_goal_real 0.0074734067 0.008786419 1.000000000
## pledge_per_backer 0.0097912664 0.088040626 0.021194402
## project_length 0.0006882141 0.009901474 0.023015175
## pledge_per_backer project_length
## backers 0.009791266 0.0006882141
## usd_pledged_real 0.088040626 0.0099014742
## usd_goal_real 0.021194402 0.0230151753
## pledge_per_backer 1.000000000 0.0266999789
## project_length 0.026699979 1.0000000000
There is a strong positive correlation between backers and pledged amount. This makes sense because with an increase in backers you can expect an increase in the amount of money pledged. None of the other numerical variables appear to have a correlation. Let’s see the relationship between backers and pledge amount.
There is alot of data in this dataset, so to help with overplotting, I changed the transparency to 20%. I added a linear regression to visualize the positive correlation between backers and pledge amount.
Since the other numerical variables are not correlated with each other, let’s take a look at some relationships between categorical data and numerical data. We’ll start by seeing if there is a relationship between success rate and project length.
Projects that have a length of 0-15 days have the highest success rate (60%) and then slowly declines, until another spike at 60-75 day projects. Next, I’m curious if there is a relationship between backers and success rate. I predict that projects with more backers are also more likely to succeed. Let’s see if that is true.
It looks like my prediction was correct: as the number of backers increases so does the success rate of the project, which makes intuitive sense. Let’s see how the average pledge per backer relates to success rate.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.45 25.72 48.96 74.86 84.61 10000.00
Interestingly, the success rate peaked at around an average pledge amount of $100-200 and then started to dip after that. I would think that as the average pledge per backer increased, so would the success rate, but this is not the case. I wonder if this is because the number of backers for the higher average pledge per backer groups is lower. Let’s take a quick look.
## # A tibble: 9 x 2
## pledge_backer_grp median_number_backers
## <fct> <dbl>
## 1 (0,10] 1
## 2 (10,25] 5
## 3 (25,50] 23
## 4 (50,100] 42
## 5 (100,150] 48
## 6 (150,200] 45
## 7 (200,500] 36
## 8 (500,1e+03] 20
## 9 (1e+03,1e+04] 9
You can see that the median number of backers for the higher average pledge groups (i.e (200,500], (500,1e+03], etc) start to dip, which is why the success rate dips at these groups as well. Basically, even though the average pledge price is high in these groups, often there are not enough backers to push these projects to success.
Next, I would like to look at the success rate over time.
From 2009 the success rate of Kickstarter projects have been declining and hit rock bottom in 2015. Interestingly, 2015 is also the year that Kickstarter had the most projects. After 2015, the number of projects dipped, but the success rate of projects started increasing.
Up until this point we’ve been measuring projects by their success rates. However, not all successful projects are equally successful; some projects may exceed their goal by a greater margin than others. In order to explore this, we need to create a new variable called pledge_goal_ratio, which divides the pledge amount by the goal amount. We will then take a look at the main categories that have the most projects with a pledge/goal ratio of 10 or more and a goal of $1,000 or more. This eliminates projects that have a goal of 1 dollar and a very high pledge/goal ratio.
The main categories with the most number of projects with a high pledge/goal ratio are Design and Games. Let’s drill down and see which categories from these top two main categories have the highest ratio.
It looks like Product Design and Tabletop Games have the most projects with a high pledge/goal ratio, by a longshot.
The feature I was most interested in was Success Rate. I found that as the number of backers increases, so does the success rate of a project, which makes intuitive sense. I also found that the projects that have the highest rate of success have an average pledge per backer of $100-200. Projects that have an average pledge per backer in this range normally have the most backers as well.
After producing a correlation matrix, I found that the number of backers has a strong positive correlation with the pledge amount of a project. This makes sense since you would expect more money to be pledged with the more people that back a project.
Another feature I explored was a project’s pledge to goal ratio, which measures how far a project surpassed or failed their goal. I found that Design and Game main categories were the most common projects to surpass their goal with a ratio of 10 or more. Further more, Product Design and Tabletop Games were the categories that surpassed their goal with a ratio of 10 or more.
The strongest relationship I found was the positive correlation between backers and pledge amount. Also, the number of backers and success rate was another strong relationship.
Similar to the bivariate analysis, let’s start off by looking at success rate by project length. However, this time let’s add one more variable: main_category. We can see if success rates and project length have different relationships with each other for certain main cateogries.
It looks like the majority of the main categories have a negative relationship between Success Rate and Project Length. However, there are some exceptions: Food, Journalism, Photography, and Technology, do not seem to have a negative relationship.
Note, that the Theatre projects seem to spike at a Project Length of 90 to 105 days, but this is because the sample size of this bin is only 1, so this point can be slightly misleading.
Let’s take a look at Success rate by number of backers and main categories.
First off, you can clearly see that all the main categories have a positive relationship between success rate and number of backers. Another, less obvious, observation is that some main category’s success rates do not increase as quickly with an increase to the number of backers. For example, Design, Games, and Technology do not have a success rate above 50% until around 50-100 backers. On the other hand, Dance, Crafts, and Comics all exceed a success rate of 50% by at least 25-50 backers. One hypothesis I had was that main categories that had a lower average goal amount would require less backers to be successful. To help visualize this, I ordered the main category panels by their average goal amount in ascending order. You can see that Design, Games, and Technology projects all have a relatively high average goal amount.
Let’s take a look at Success rate over time by main categories.
Here are some quick observations I noticed from this chart:
Art, Crafts, Dance, Food, Music, Journalism, and Technology all took a dip in success rate around 2014 and 2015.
Games, Comics, and Design have increased in success rate from 2016 and on.
Dance and Theater, in general, have always had a pretty high success rate at above 60%. While Fashion has never had a success rate of over 50%.
In the bivariate analysis section, we took a look at the number of High Pledge/Goal ratio projects by main category. Let’s now add another variable, time, and see if that gives us further insight.
After adding time to the analysis, we can see that Kickstarter projects rarely exceeded a Pledge/Goal ratio over 10 when the company first started in 2009. Starting around 2012 the number of these projects started to increase for multiple main categories. The most notable main category types are Design, Games, and Technology. The number of high ratio Design and Games projects have been steadily increasing since 2011.
For this multivariate analysis, I built off of the charts that I created for the bivariate analysis and added a new variable. In most of these cases, that new variable was main_category. This allowed me to look at how different variables affect success rate for each of the main categories. Here are some of the observations that I made:
Most main categories show that there is a negative relationship between success rate and project length. However, some main categories like Food and journalism do not show this negative relationship.
All main categories show that as the number of backers increase so does the success rate. Interestingly, though, some categories (like design and technology) show a slower increase in success rate when the number of backers increase.
Food, Journalism, Photography, and Technology have been declining in success rate since 2009.
The number of high pledge/goal ratio projects for Design, Games, and Technology have been increasing in the past 5 years.
I think the most interesting interaction that I discovered was that, in general, main categories that have a higher average goal amount need more backers to increase their success rate. The “Success Rate by Number of Backers and Main Category” chart shows this clearly.
I chose to revamp this plot because it effectively tells multiple stories. It shows the distribution for Goal and Pledge Amounts and it also shows that many of the projects have a lower pledge amount than the goal amount. To emphasize the last ladder point, I lowered the opacity of the bars to 60%. I also removed some of the grid lines and background colors to bring the histogram bars into focus. Lastly, I moved the legend to the bottom of the figure to free up some space for the chart.
This plot does a good job of showing the relationship between the number of backers and the success/failure rate. I changed the title of the chart to make the key insight/takeaway clear. Next, I chose to color the success rate green, since this is a common color associated with success. Likewise, I chose red for failure for similar reasons. I moved the legend to the bottom of the figure to make room for the actual stacked bar chart. Finally, I rounded the last grouping to 220,000 so that it is cleaner to read.
I already made some key tweaks to this chart during my multivariate analysis. For example, I ordered the Main Category Panels by their Average Goal Amount in ascending order. I chose to do this in order to reveal that the lines with the flatter slopes (i.e the categories with a slower increase in success rate relative to an increase in number of backers) are normally the categories that have a higher average goal amount. I also decided to remove the default grey background, which to me does not add any value to the chart. Finally, to remove clutter from the chart I decided to decrease the number of breaks in the y-axis so that it only shows 0%, 50%, and 100%.
This exploration started even before writing a single line of R code. I wanted to choose a dataset that I was interested in and one that was fairly clean. After spending almost an hour searching the internet for a dataset that inspired me, I finally found this Kickstarter Dataset.
I spent some time making sure the dataset was clean and adding columns that would come in handy for future analysis, but luckily this dataset was pretty easy to work with. Once I started the univariate analysis I found simple insights like: the median number of backers for a project is 12, the product design category had the most projects, and more projects failed than succeeded. The ladder insight naturally led me to start doing some bivariate analysis like figuring out which project category had the best success rates, but I tried to stick to doing just univariate analysis for that section.
One interesting insight that I discovered through the univariate analysis section was that Game projects had the most backers of any type of project, but their success rate was just middle of the pack. I found that this may be because the average pledged amount for games is relatively low, while the average goal amount for these projects are relatively high. Basically, game projects have alot of backers that contribute a small amount of money, and unfortunately many times the goal amount is too high for the project to succeed.
Next, I did some bivariate analysis and focused mainly on how different variables relate to the success rate of a project. By far, the strongest relationship that I found was that as the number of backers for a project increases so does the likelihood of success. Another interesting insight I found was that projects that had an average pledge per backer of $100-200 had the highest success rates. If the average pledge per backer was higher than that, often times the number of backers would be too low to push the project to success.
Lastly, during my multivariate analysis I chose to focus on insights that I looked at during the bivariate analysis while adding main category as a new variable. I think the most interesting observation from this analysis was that for every main category, as the number of backers increased so did the success rate. However, some main categories’ success rates increased more slowly as the number of backers increased. I found that this may be correlated to the average goal amount for the specific main category.
In the future I would like to create some predictive models that could forecast how likely a project is to succeed based on the category, backers, goal amount. This analysis could be even more insightful if we had demographic data about the backers for each project. For example, if we had the ages of the backers, we could answer something like “Which age group has the highest average pledge amount for Game projects?” This would help the owner of the project determine which age group they should target to raise the most amount of money.